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Abstract 

We consider the high-dimensional heteroscedastic regression model, 
where the mean and the log variance are modeled as a linear combination 
of input variables. Existing literature on high-dimensional linear regres- 
sion models has largely ignored non-constant error variances, even though 
they commonly occur in a variety of applications ranging from biostatis- 
tics to finance. In this paper we study a class of non-convex penalized 
pseudolikelihood estimators for both the mean and variance parameters. 
We show that the Heteroscedastic Iterative Penalized Pseudolikelihood 
Optimizer (HIPPO) achieves the oracle property, that is, we prove that 
the rates of convergence are the same as if the true model was known. We 
demonstrate numerical properties of the procedure on a simulation study 
and real world data. 



1 Introduction 

High-dimensional regression models have been studied extensively in both ma- 
chine learning and statistical literature. Statistical inference in high-dimensions, 
where the sample size n is smaller than the ambient dimension p, is impossible 
without assumptions. As the concept of parsimony is important in many sci- 
entific domains, most of the research in the area of high-dimensional statistical 
inference is done under the assumption that the underlying model is sparse, in 
the sense that the number of relevant parameters is much smaller than p, or 
that it can be well approximated by a sparse model. 

Penalization of the empirical loss by the £i norm has become a popular tool 
for obtaining sparse models and huge amount of literature exists on theoretical 
properties of estin i ation procedures [see, e.g . , I Zhao and IWainwrightl 
20091 Zhang . 20091 Zhang and Huang . 2008 . and refer ences therein] and on ef- 



ficient algorithms that numerically find estimates [see I Bach et al.l . l201ll for an 



extensive literature review]. Due to limitations of the £i norm penalization, 
high-dimensional inference methods based on the class of concave penalties have 
been proposed that have better theoretical and numerical properties [see, e.g. . 



Fan and Lil200lllFan and LvLl2009llLv and Fanl . l2009llZhang and Zhang| . l2011 | 



'Authors listed alphabetically. 
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In all of the above cited work, the main focus is on model selection and 
mean parameter estimati on. Only few papers deal with estimati on of the vari- 
ance in high-dimensions Sun and Zhang . 2011 , Fan et al. . 2012j although it is 
a fundamental problem in statistics. Variance appears in the confidence bounds 
on estimated regression coefficients and is important for variable selection as it 
appears in Akaike's information criterion (AIC) and the Bayesian information 
criterion (BIG). Furthermore, it provides confidence on the predictive perfor- 
mance of a forecaster. 

In applied regression it is often the case that the error variance is non- 
constant. Although the assumption of a constant variance can sometimes be 
achieved by transforming the dependent variable, e.g., by using a Box-Cox 
transfor mation, in many cases transf ormation does not produce a constant error 



variance [Carroll and Ruppertl . |1988| . Another approach is to ignore the hetero- 
geneous variance and use standard estimation techniques, but such estimators 
are less efficient. Aside from its use in reweighting schemes, estimating variance 
is important because the resulting prediction intervals become more accurate 
and it is often important to explore which input variables drive the variance. In 
this paper, we will model the variance directly as a parametric function of the 
explanatory variables. 

Heteroscedastic regression models are used in a variety of fields ranging from 
biostatistics to econometrics, finance and quality control in manufacturing. In 
this paper, we study penalized estimation in high-dimensional heteroscedastic 
linear regression models, where the mean and the log variance are modeled as a 
linear combination of explanatory variables. Modeling the log variance as a lin- 
ear combination of the explanatory variables is a common choice as it guarantees 
positivity and is also capable of capturing varian c e that may va ry over several 
orders of magnitudes [Carroll and Ruppert (l988| . Harvey 1976 [. Main contri- 
butions of this paper are as follows. First, we propose HIPPO (Heteroscedastic 
Iterative Penalized Pseudolikelihood Optimizer) for estimation of both the mean 
a nd variance parame ters. Second, we establish the oracle property (in the sense 
of lFan and Lvl [2009| |) for the estimated mean and variance parameters. Finally, 
we demonstrate numerical properties of the proposed procedure on a simulation 
study, where it is shown that HIPPO outperforms other methods, and analyze 
a real data set. 



1.1 Problem Setup and Notation 

Consider the usual heteroscedastic linear model, 

=x-/3 + (7(xi,0)e,;, j = l,...,n, (1) 

where X = (xi, . . . , x„)' = (Xi, . . . , Xp) is an n x p matrix of predictors with 
i.i.d. rows xi, . . . , x„, y — (jji, . . . , y„) is an n-vector of responses, the vectors 
P G W and 6 E M.P are p- vectors of mean and variance parameters, respectively, 
and e = (ei, . . . e„) is an n-vector of i.i.d. random noise with mean and variance 
1. We assume that the noise e is independent of the predictors X. The function 
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(t(x, 9) has a known parametric form and, for simplicity of presentation, we 
assume that it takes a particular form cr(xi, 0) = exp(x-0/2). 

Throughout the paper we use [n] to denote the set {1, . . . , n}. For any index 
set S C [p], we denote /3s to be the subvector containing the components of 
the vector /3 indexed by the set S, and X5 denotes the submatrix containing 
the columns of X indexed by S. For a vector a e M", we denote supp(a) = 
{j ■ o,j ^ 0} the support set, ||a||g, q e (0,cx)), the f^-norm defined as ||a||g = 
(X)ie[n] ^'ly^'^ '^^^^^ usual extensions for q g {0, 00}, that is, ||a||o = |supp(a)| 
and ||a||oo = maxjg[„] \ai\. For notational simplicity, we denote || • || = || • II2 the 
£2 norm. For a matrix A G R"^^ we denote |||A|||2 the operator norm, ||A||i? 
the Frobenius norm, and Aniin(A) and Aniax(A) denote the smallest and largest 
eigenvalue respectively. 

Under the model in ([T]), we are interested in estimating both /3 and 0. In 
high-dimensions, when p ^ n, it is common to assume that the support /3 is 
small, that is, S = supp(/3) and \S\ <C n. Similarly, we assume that the support 
T = supp(0) is small. 



1.2 Related Work 



Consider the model ([T]) with constant variance, i.e., a{x,9) = (Jq. Most of the 
existing high-dimensional literature is focused on estimation of the mean pa- 
rameter (3 in this homoscedastic regression model. Under a variety of assump- 
tions and regularity conditions, any penalized estimation procedure mentioned 
in introduction can, in theory, select the correct sparse model with probability 
tendin g to 1. Literature on variance estimation is not as developed. iFan et al 



2012{ pr oposed a two step pro cedure for estimation of the unknown variance 
CTo, while ISun and Zhana [2011j proposed an estimation procedure that jointly 



estimates the model and the variance. 

Problem of estimation in the heteroscedastic linear regression models have 
been studied extensively in the classical setting with p fixed, however, the prob- 
lem of e stimation under the model ^ when p ^ n has not been adequately 
studied. IJia et al.l |2010j assume that a{x,9) = |x'/3| and show that Lasso is 
sign consistent for the mean parameter f3 under certain conditions. Their study 
shows limitations of lasso, for which many highly scalable solvers exist. How- 
ever, no new methodology is developed, as t he authors acknowledge th at the 
log-likelihood function is highly non-convex. iDette and Wageneii 201 1| study 
the adaptive lasso under the model in ([T]). Under certain regularity conditions, 
they show that the adaptive lasso is consistent, with suboptimal asymptotic 
variance. However, the weighted adaptive lasso is both consistent and achieves 
optimal asymptotic variance, under the assumption that the variance function is 
consistently estimated. However, they do not discuss how to obtain an estima- 
tor of the varian ce function i n a p rincipled way and resort to an ad-hoc fitting 
of the residuals. Dave et al. (201 ij develop HHR procedure that optimizes the 
penalized log-likelihood under ([T]) with the ^i-norm penalty on both the mean 
and variance parameters. As the objective is not convex, HHR estimates f3 
with 9 fixed and then estimates 9 with /3 fixed, until convergence. Since the 
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objective is biconvex, HHR converges to a stationary point. However, no theory 
is provided for the final estimates. 



2 Methodology 

In this paper, we propose HIPPO (Heteroscedastic Iterative Penahzed Pseudo- 
hkclihood Optimizer) for estimating /3 and 6 under model ((1]). 

In the first step, HIPPO finds the penalized pseudolikelihood maximizcr of 
f3 by solving the following objective 

$ = arg min ||y - X(3\\' + 2n ^ PA.(|ft-|), (2) 

je[p] 

where p^s is the penalty function and the tuning parameter A5 controls the 
sparsity of the solution 0. 

In the second step, HIPPO forms the penalized pseudolikelihood estimate 
for 6 by solving 

6 = arg min > x'0 + > rif exp(— x'0) 

+ 4n^PA^(|^,|) 

je[pl 

where 17 = y — is the vector of residuals. 

Finally, HIPPO computes the reweighted estimator of the mean by solving 



/3„ = arg min J] + 2n ^ p,,(|/3,|) (4) 

ie[ri] je[p] 

where Ui = cxp(x';0/2) are the weights. 

In classical literature, estimation under heteroscedastic models is achieved 
by employing a pseudolikelihood objective. The pseudolikelihood maximization 
principle prescribes the scientist to maximize a surrogate likelihood, i.e. one that 
is believed to be similar to the likelihood with the true unknown fixed variances 
(or means alternatively). In classical theory, central limit theorems are derived 
for many pseudo- maximum likel ihood (PML) estimators using generalized es- 
timating equations IZiegleil 2011 1. HIPPO fits neatly into the pseudolikelihood 



framework because the first step is a regularized PML where only the mean 
structure needs to be correctly specified. The second step and third steps may 
be similarly cast as PML estimators. Indeed, all our theoretical results are due 
to the fact that in each step we are optimizing a pseudolikelihood that is similar 
to the true unknown likelihoods (with alternating free parameters). Moreover, 
it is known that if the surrogate variances in the mean PML are more similar 
to the true variances then the resulting estimates will be more asymptotically 
efficient. With this in mind, we recommend a third reweighting procedure with 
the variance estimates from the second step. 
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Fan and l1 |2001| advocate usage of penalty functions that result in esti- 
mates satisfying three properties: unbiasedness, sparsity and continuity. A rea- 
sonable estimator should correctly identify the support of the true parameter 
with probability converging to one. Furthermore, on this support, the estimated 
coefHcients should have the same asymptotic distribution as if an estimator that 
knew the true support was used. Such an estimator satisfies the oracle property. 
A number of co ncave penalt i es res ult in estimates that satisfy this property: the 



SCAD penalty [Fan and Lil |2001|. the MCP p enalty [Zhangj . |201Q | and a class 
of folded concave penalties Lv and FanI |2009l |. For concreteness, we choose to 



use the SCAD penalty, which is defined by its derivative 



nt < A} + ^-^^^Ht > A} 
[a — 1)A 



(5) 



where often a — 3.7 is used. Note that estimates produced by the £i-norm 
penalty are biased, and hence this penalty does not achi eve oracle property . 

HIPPO is related to the iterative HHR algorithm of Dave et ahl 2011 1. In 
particular, the first two iterations of HHR are equivalent to HIPPO with the 
SCAD penalty replaced with the £i norm penalty. In practice, one can continue 
iterating between solving ^ and dU, however, establishing theoretical prop- 
erties for those iterates is a non-trivial task. From our numerical studies, we 
observe that HIPPO performs well when stopped after the first two iterations. 



2.1 Tuning Parameter Selection 

As described in the previous section, HIPPO requires selection of the tuning 
parameters As and At, which balance the complexity of the estimated model 
and the fit to data. A common approach is to form a grid of candidate values 
for the tuning parameters As and At and chose those that minimize the AIC 
or BIC criterion 

AIC(As, At) = ^(y*' ^' ^) + 24f, (6) 

ie[n] 

BIC(As,At)= ^ %„x,;/3,0)+dflogn (7) 

ie[n] 

where, up to constants, 

£{y, x; /3, 0) = x'0 + {y - ^'f3f exp(-x'0) 

is the negative log-likelihood and 

|supp(/3)| + |supp(0)| 

is the estimated degrees of freedom. In Section we compare performance of 
the AIC and the BIC for HIPPO in a simulation study. 
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2.2 Optimization Procedure 



In this section, we describe numerical procedures used to solve optimization 
problems in ([2), ([3]) and dH). Our procedures are based on the loca l linear 
approximation for the SCAD penalty developed in IZou and Li 2008 1, which 
gives: 



Pxm) « PA(|/3f I) +p'A(l/?f - l/3f I 

for /3, w /3 



(fc) 



This approximation allows us to substitute the SCAD penalty X]je[p] Pa(|/3j |) 
in @, ® and dH) with 



(fe) 



(8) 



and iteratively solve each objective until convergence of {/3('=)}fe. We set the 
initial estimates /3^°'' and ^f"-' to be the solutions of the £i-norm penalized 
problems. The convergence of these iterative approxima tions follows from the 
convergence of the MM (minorize-maximize) algorithms Zou and Li , 2008| . 
With the approximation of the SCAD penalty given in ([8]), we can solve 
and dH) using standa rd lasso solvers, e.g., we use the proximal method of 
Beck and Teboulld |2009l| . The objecti ve in @ is minimi zed using a coordinate 
descent algorithm, which is detailed in iDave et al.l |201l| . 



3 Theoretical Properties of HIPPO 

In this section, we present theoretical properties of HIPPO. In particular, we 
show that HIPPO achieves the oracle property for estimating the mean and 
variance under the model dU- AH the proofs are deferred to Appendix. 

We will analyze HIPPO under the following assumptions, which a re standard 
in th e literature on high-dimensional statistical learning [see, e.g. iFan et aL . 
2012f . 

Assumption 1. The matrix X = (xi, . . . , x„)' e R"^^ has independent rows 
that satisfy = S^/^z^ where {z^ji are i.i.d. subgaussian random variables with 
Ezi = 0, EzjZ^ = I and parameter K (see Appendix for more details on subgaus- 
sian random variables). Furthermore, there exist two constants Cmin, Cmax > 
such that 

< Cniin < An,in(S) < An,ax(S) < C^ax < OO. 

Assumption 2. The errors ei, . . . , e„ are i.i.d. subgaussian with zero mean 
and parameter 1. 

Assumption 3. There are two constants /3 and 6 such that ||/3|| < /3 < oo 
and ||0|| < ^ < oo. 

Assumption 4- \S\ = Csn"^ and |T| = Crn'^'^ for some as G (0, 1) and 
ar e (0, 1/3) and constants Cg, Ct > 0. 

The following assumption will be needed for showing the consistency of the 
weighted estimator in dH). 
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Assumption 5. Define 



Dss = 7i-iX'5diag(exp(-X0))Xs. 
There exist constants < i^min, £'max < oo sucli that 

hm P[A„,ax(D55) < i^max] = 1, and 

lim P[A„,in(D55) > ^min] - 1- 

n—^oo 

Furthermore, we have that 

lim IIID55-IED55III2 =op(l). 

n— ^00 

With these assumption, we state our first result, regarding the estimator (3 
in (HI). 

Theorem 1. Suppose that the assumptions (l)-(4) are satisfied. Furthermore, 



assume that \s > ci\J\og{p) exp(^/c2 log(n))/n, raiuj^^s] \ > As > C3Y/log(s) exp(^C2 log(n))/n 
and log(p) = ©(n"") for some G (0, 1). Then there is a strict local minimizer 
$ = {$'g,0'gcy of (HI) that satisfies 



||/3s-/3slU<c3^ ^^P^^^^^)^"g^^) (9) 

for some positive constants Ci,C2, and C3 and sufficiently large n. 

In addition, if we suppose that assumption (5) is satisfied, then for any fixed 
a S with ||a||2 = 1 the following weak convergence holds 

^a'(/3s-/35) AaA(0,1) (10) 

where = a'Sg^EDssSg^a. 

The first result stated in TheoremU] establis hed that (3 achieves the weak or 



acle property in the sense of ILv and FanI |2009| . The extra term exp(Vlogn) is 



subpolynomial in n and appears in the bound ([9]) due to the heteroscedastic na- 
ture of the errors. The seco nd result esta blishes the strong oracle property of the 



estimator /3 in the sense of iFan and Lvl ^2009j . that is, we establish the asymp- 



totic normality on the true support S. The asymptotic normality shows that 
(3s has the same asymptotic variance as the ordinary least squares (OLS) esti- 
mator on the true support. However, in the case of a heteroscedastic model the 
OLS estimator is dominated by the generalized least squares estimator. Later 
in this section, we will demonstrate that f3w has better asymptotic variance. 
Note that (3 correctly selects the mean model and estimates the parameters at 
the correct rate. From the upper and lower bounds on A5, we see how the rate 
at which p can grow and the minimum coefficient size are related. Larger the 
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ambient dimension p gets, larger the size of As, which lower bounds the size of 
the minimum coefficient. 

Our next result establishes correct model selection for the variance parameter 

d. 

Theorem 2. Suppose that assumptions (l)-(5) are satisfied. Suppose further 
that At > n"^""'^/^ log(p) log(n) and min^gj-r] l^jl ^ -^t- Then there is a strict 
local minimizer 6 = {6'rp,0'rpcY with the strong oracle property, 

||n(i-"^)/2(0-6/)|| =Op(l) 

Morover, for any fixed a G M* with ||a||2 = 1 the following weak convergence 
holds 

^a'(0T-0T) AaA(0,1) (11) 

where = a'S^^a. 

With the convergence result of 6 we can prove consistency and asymptotic 
normality of the weighted estimator /3 in . 

Theorem 3. Suppose that the assumptions (l)-(5) are satisfied and that there 
exists an estimator satisfying \ \6 — 0\\2 = 0{rn), for a seguence r„ — )■ and 

supp(0) = supp(0). Furthermore, assume that Xs > Ci\J\og{p) exp(-\/c2 log(n))/n, 
minjg[s] |/3j| > As > C3r„ exp(^c2 log(n)) log(7i) and log(p) = ©(n"") for 
some ao £ (0,1). Then there is a strict local minimizer j3yj — (/3^ 5,05c) of 
(21) that satisfies 

I Ww,s - l^sWoo < c^rn exp(\/c2 log(n)) log(n) (12) 

for some positive constants Ci,C2, and C3 and sufficiently large n. 

Furthermore, for any fixed a e R'' with ||a||2 = 1 the following weak conver- 
gence holds 

^a'(/3-/3)-^AA(0,l) (13) 

where (1 = a'(EDss)-ia. 

Theorem [3] establishes convergence of the weighted estimator 0w hi ^ 
and the model selection consistency. The rate of convergence depends on the 
rate of convergence of the variance estimator, r„. From Theorem [2l we show 
the parametric rate of convergence for 9s- The second result of Theorem [3] 
states that the weighted estimator f3w,s is asymptotically normal, with the same 
asymptotic variance as the generalized least squares estimator which knows the 
true model and variance function (t(x, 9). 
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||6l-6l||2 Preg Recg 

P = 

HHR-AIC 0.59(0.13) 0.4(0.17) 1.00(0.00) 

HIPPO-AIC 0.26(0.15) 0.6(0.22) 1.00(0.00) 

HHR-BIC 0.59(0.13) 0.39(0.16) 1.00(0.00) 

fflPPO-BIC 0.26(0.15) 0.59(0.22) 1.00(0.00) 

0.5 

HHR-AIC 0.32(0.12) 0.68(0.21) 1.00(0.00) 

HIPPO-AIC 0.38(0.22) 0.69(0.25) 1.00(0.03) 

HHR-BIC 0.32(0.12) 0.68(0.21) 1.00(0.00) 

HIPPO-BIC 0.38(0.22) 0.69(0.25) 0.99(0.03) 



Table 1: Mean (sd) performance of HHR and HIPPO under the model in 
Example 1 (averaged over 100 independent runs). The mean parameter /3 is 
assumed to be known. 



4 Monte-Carlo Simulations 

In this section, we conduct two small scale simulation studies to demonstrate 
finite sample perf ormance of HIPPO . We compare it to the HHR procedure 



Dave et al.l , 12011 



Convergence of the parameters is measured in the I2 norm, — and 
1 1^ — 0| I . We measure the identification of the support of (3 and 9 using precision 
and recall. Let S denote the estimated set of non-zero coefficients of 5, then the 
precision is calculated as Pre^ := |S'nS'|/|S'| and the recall as Rec^ := |S'nS'|/|S'|. 
Similarly, we can define precision and recall for the variance coefficients. We 
report results averaged over 100 independent runs. 



4.1 Example 1 

Assume that the data is generated iid from the following model Y = (T(X)e 
where e follows a standard normal distribution and the logarithm of the variance 
is given by 

l0gC7(X)2 =Xi+X2+X3. 

The covariates associated with the variance are jointly normal with equal corre- 
lation yO, and marginally A/'(0, 1). The remaining covariates, X4, . . . , Xp are iid 
random variables following the standard Normal distribution and are indepen- 
dent from (Xi, X2, X3). We set (n,p) = (200, 2000) and use p = and p = 0.5. 
Estimation procedures know that /3 = and we only estimate the variance pa- 
rameter 6. This example is provided to illustrate performance of the penalized 
pseudolikelihood estimators in an idealized situation. When the mean parame- 
ter needs to be estimated as well, we expect the performance of the procedures 
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#it 


ll/3-/3||2 


Prefl 


Recfl 




Pree 


Recg 










n 


= 200 






HHR-AIC 


1st 


0.78(0.52) 


0.44(0.22) 


1.00(0.00) 


2.10(0.11) 


0.25(0.10) 


0.54(0.16) 




2nd 


0.31(0.13) 


0.88(0.15) 


1.00(0.00) 


1.80(0.16) 


0.29(0.07) 


0.71(0.14) 


HIPPO- AIC 


1st 


0.66(0.84) 


0.75(0.29) 


1.00(0.02) 


2.00(0.16) 


0.20(0.10) 


0.52(0.16) 




2nd 


0.08(0.07) 


0.84(0.24) 


1.00(0.00) 


1.50(0.30) 


0.30(0.11) 


0.75(0.12) 


HHR-BIC 


1st 


0.77(0.48) 


0.58(0.17) 


1.00(0.00) 


2.10(0.10) 


0.41(0.18) 


0.45(0.14) 




2nd 


0.31(0.13) 


0.89(0.13) 


1.00(0.00) 


1.90(0.16) 


0.38(0.15) 


0.65(0.17) 


HIPPO-BIC 


1st 


0.70(0.83) 


0.80(0.25) 


0.99(0.03) 


2.00(0.14) 


0.39(0.18) 


0.50(0.17) 




2nd 


0.08(0.06) 


0.97(0.07) 


1.00(0.00) 


1.60(0.28) 


0.44(0.16) 


0.72(0.14) 


HHR-AIC 


1st 


0.59(0.37) 


0.58(0.26) 


n 

1.00(0.00) 


= 400 

1.90(0.11) 


0.36(0.14) 


0.72(0.18) 




2nd 


0.30(0.24) 


0.98(0.06) 


1.00(0.00) 


1.70(0.16) 


0.43(0.13) 


0.81(0.16) 


HIPPO- AIC 


1st 


0.44(0.54) 


0.87(0.22) 


1.00(0.00) 


1.80(0.18) 


0.28(0.10) 


0.67(0.15) 




2nd 


0.06(0.29) 


0.97(0.12) 


1.00(0.02) 


1.00(0.31) 


0.56(0.18) 


0.93(0.09) 


HHR-BIC 


1st 


0.59(0.37) 


0.66(0.20) 


1.00(0.00) 


1.90(0.11) 


0.46(0.18) 


0.66(0.20) 




2nd 


0.30(0.23) 


0.98(0.06) 


1.00(0.00) 


1.70(0.17) 


0.46(0.13) 


0.80(0.17) 


HIPPO-BIC 


1st 


0.46(0.58) 


0.89(0.19) 


1.00(0.01) 


1.80(0.18) 


0.39(0.17) 


0.65(0.17) 




2nd 


0.06(0.29) 


0.99(0.06) 


1.00(0.02) 


1.00(0.31) 


0.63(0.20) 


0.92(0.09) 



Table 2: Mean (sd) performance of HHR and HIPPO under the model in 
Example 2 (averaged over 100 independent runs). We report estimated models 
after the first and second iteration. 



only to get worse. Since the mean is known, both HHR and HIPPO only solve 
the optimization procedure in HHR with the £i-norm penalty and HIPPO 
with the SCAD penalty, without iterating between ^ and 

Table [T] summarizes the results. Under this toy model, we observe that 
HIPPO performs better than HHR when the correlation between the relevant 
predictors is p = 0. However, we do not observe the difference between the two 
procedures when p = 0.5. The difference between the AIC and BIC is already 
visible in this example when p = Q. The AIC tends to pick more complex models, 
while the BIC is more conservative and selects a model with fewer variables. 



4.2 Example 2 

The following non-trivial model is borrowed from Dave et al. (20111]. The re- 
sponse variable Y satisfies 

i6[p] ie[p] 
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with p = 600, = 2,00^ 1, 



/3[i2] =(3,3,3,1.5,1.5,1.5,0,0,0,2,2,2)', 

0[i5] = (1, 1, 1, 0, 0, 0, 0.5, 0.5, 0.5, 0, 0, 0, 0.75, 0.75, 0.75)', 

and the remainder of the coefficients are 0. The covariates are jointly Normal 
with cov{X„Xj) = 0.5l*--'l and the error e follows the standard Normal distri- 
bution. This is a more realistic model than the one described in the previous 
example. We set p — 600 and the number of samples n = 200 and n = 400. 

Table [5] summarizes results of the simulation. We observe that HIPPO con- 
sistently outperforms HHR in all scenarios. Again, a general observation is that 
the AIC selects more complex models although the difference is less pronounced 
when the sample size n = 400. Furthermore, we note that the estimation error 
significantly reduces after the first iteration, which demonstrates final sample 
benefits of estimating the variance. Recall that Theorem [T] proves that the esti- 
mate (3 consistently estimates the true parameter (3. However, it is important 
to estimate the variance parameter 6 well, both in theory (see Theorem [3]) and 
practice. 

5 Real Data Application 

Forecasting the gross domestic product (GDP) of a country based on macroeco- 
nomic indicators is of significant interest to the economic community. We obtain 
both the country GDP figures (specifically we use the GDP per capita using cur- 
rent prices in units of a 'national currency') and macroeconomic variables from 
the International Monetary Fund's World Economic Outlook (WEO) database. 
The WEO database contains records for macroeconomic variables from 1980 to 
2016 (with forecasts). 

To form our response variable, yi^t, we form log- returns of the GDP for each 
country (i) for each time point (t) after records began and before the forecast- 
ing commenced (each country had a different year at which forecasting began). 
After removing missing values, we obtained 31 variables that can be grouped 
into a few broad categories: balance of payments, government finance and debt, 
inflation, and demographics. We apply various transformations, including lag- 
ging and logarithms forming the vectors x^.t. We fit the heteroscedastic AR(1) 
model with HIPPO. 

yi.t = x^,t_i/3 + exp(x^^t„i0)ei,t 

In order to initially assess the heteroscedasticity of the data, we form the 
LASSO estimator with the LARS package in R selecting with BIG. It is common 
practice when diagnosing heteroscedasticity to plot the studentized residuals 
against the fitted values. We bin the bulk of the samples into three groups by 
fitted values, and observe the box-plot of each bin by residuals (Figure [2]). It 
is apparent that there is a difference of variances between these bins, which is 
corroborated by performing a F-test of equal variances across the second and 
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HIPPO 



HHR 



MSE 
%,X;/3,0) 



0.0089 
0.4953 

5.4 

8.2 



0.0091 
0.6783 

8.9 

5.1 



1^1 
|f| 



Table 3: Performance of HIPPO and HHR on WEO data averaged over 10 
folds. 

third bins (p- value of 4 x 10~^). We further observe differences of variance 
between country GDP log returns. We analyzed the distribution of responses 
separated by countries: Canada, Finland, Greece and Italy. The p-value from 
the F-test for equality of variances between the countries Canada and Greece is 
0.008, which is below even the pairwise Bonferroni correction of 0.0083 at 0.05 
significance level. This demonstrates heteroscedasticity in the WEO dataset, 
and we are justified in fitting non-constant variance. 

We compare the results from HIPPO and HHR when applied to the WEO 
data set. The tuning parameters were selected with BIG over a grid for A5 and 
At- The metrics used to compare the algorithms are mean square error (MSE) 
defined by ^ J2iiyi:t~yi:t)^ ^ ^^e partial prediction score defined as the average of 
the negative log likelihoods, and the number of selected mean parameters and 
variance parameters. We perform 10-fold cross validation to obtain unbiased 
estimates of these metrics. In Table |3] we observe that HIPPO outperforms 
HHR in terms of MSE and partial prediction score. 

6 Discussion 

Wc have addressed the problem of statistical inference in high-dimensional lin- 
ear regression models with heteroscedastic errors. Heteroscedastic errors arise 
in many applications and industrial settings, including biostatistics, finance and 
quality control in manufacturing. We have proposed HIPPO for model selec- 
tion and estimation of both the mean and variance parameters under a het- 
eroscedastic model. HIPPO can be deployed naturally into an existing data 
analysis work-flow. Specifically, as a first step, a statistician performs penal- 
ized estimation of the mean parameters and then, as a second step, tests for 
heteroscedasticity by running the second step of HIPPO. If heteroscedastic- 
ity is discovered, HIPPO can then be used to solve penalized generalized least 
squares objective. Furthermore, HIPPO is well motivated from the penalized 
pseudolikelihood maximization perspective and achieves the oracle property in 
high-dimensional problems. 

Throughout the paper, we focus on a specific parametric form of the variance 
function for simplicity of presentation. Our method can be extended to any 
parametric form, however, the assumptions will become more cumbersome and 
the particular numerical procedure would change. It is of interest to develop 
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general unified framework for estimation of arbitrary parametric form of the 
variance function. Another open research direction includes non-parametric 
estimation of the variance functio n in high-dimensions, w hich could be achieved 



with sparse additive models [see Ravikumar et al. . 2009l |. 
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7 Appendix 



In the appendix, we collect some well known results and provide proofs for the 

results in the main text. 

For readers convenience, we summarize the notation again. We use [n] to 
denote the set {l,...,n}. For any index set S C [p], we denote f3s to be 
the subvector containing the components of the vector j3 indexed by the set 
iS, and Xs denotes the submatrix containing tlic columns of X indexed by S. 
For a vector a e K", we denote supp(a) = {j : aj ^ 0} the support set, 
||a||q, q € (0,00), the i'^-norm defined as ||a||g = (X]je[n] '^iY^'^ with the usual 
extensions for q G {0,oo}, that is, ||a||o = |supp(a)| and ||a||oo = maxj£[„] |aj|. 
For notational simplicity, we denote || • || = || • II2 the £2 norm. The unit sphere 
of ^2 = (K", II • II) is denoted by S"-~^. The canonical bases of £3 we denote by 
ei,...,e„. For a matrix A e W^^p we denote |||A|||2 = sup{||A2;|| : ||a;|| = 1} 

the operator norm and ||A||i? = y^X]je[n] Sje[p] ^ij Probenius norm. For a 
symmetric matrix A e M"^", we use Amin(A) to denote Ai„ax(A) the smallest 
and largest eigenvalue respectively. 



7.1 Subgaussian random variables 

In this section, we define subgaussian random variables and state a few well 
known properties. 

We denote ||X||ip the Lp norm of a random variable X, i.e., ||^||lp = 
{EE\X\Py/P. Define the Orlitz function ^^{x) = exp(|a;|") - 1, a > 1. 
Using the Orlitz function, we can define the Orlitz space of real valued random 
variables, L^^ , equipped with the norm 

\\X\\^^ = inf{c > : Eexp(|X/c|") < 2}. 

We will focus on the particular choice of a = 2. Define B^^Ct) the set of 
real-valued symmetric random variables satisfying 

1 < ||X||l, and ||a;||*, <7. (14) 

For X e (7) we have a good control of the tail probability 

P[X >t]< exp(-iV72), (15) 

which can be obtained using the Markov inequality 

2F[X > t] < n\X\ > t] < ^l^llf^/^^^^ < 2exp(-tVy) 

since X is symmetric and Eexp(X^/7^) < 2. 

The space Lq,.^ is the set of subgaussian random variables. A real-valued 
random variable X is called subgaussian with parameter p, 1/ > 0, ii 

Eexp(iX) < exp{p^t^/2), for all t > 0. (16) 
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It follows from this bound on the moment generating function that the following 
bound on the tail probability holds 

P[X >t]< exp{-t^/{2i'^)) for any t > 0. (17) 

We also have that X e B^^{ij) is subgaussian with parameter \/2^ by direct 
calculation. 

The following few facts are useful. 

Lemma 4. Let 7i > 1 and Xi G B<i,^{'-^i), i — l,...,n, be independent vari- 
ables, then for any oi, . . . , a„ € Sig[n] ^-i-^i ^■^ subgaussian with parameter 

2E^G[n]7^^a|. 

Proof. For any t > 0, we have 

Eexp(t J2 «'^») = n IEexp(ia,X,) < J| eMt'^a^^lf) = exp(i2 ^ a2^2^. 

i€[n] J'e[n] iS[ra] iS[n] 

The claim follows from (ITBl). □ 



Lemma 5. Lef 7 > 1 and Xi G i?*^ (7), « = 1, . . . , n, be independent variables, 
then for any u >0, 

^ ^ exp(n(log(2) - (u/jf)). (18) 

ieln] 

Proof. Using Markov inequality we have, 

P[X1 - "^"] - exp{-j-'^u^n)Eexp{j-^ ^ Xf) < 2"exp(-7" 



2 2 \ 
U n] 



which concludes the proof. □ 

The notion of subgaussian random variable can be easily extended to vector 
random variables. Let Z £ Rp be random variable satisfying EZ = 0, EZZ' = 
Ip. The random variable Z is subgaussian with parameter v if it satisfies 

sup \\{'Zi,^N)\\^,^ <v. (19) 

Let X = where S G positive definite matrix and S-^/^ ^j^g symmet- 

ric matrix square root. The following result is standard in multivariate statistics 
AndersonLbooaj . 



Lemma 6. Let S C [p] and j ^[p], j ^ S. Then 

X,^{Xs,{^ss)-'^Sj)+E, (20) 
and Ej is uncorrelated with X5. 
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The following lemma shows that Ej is subgaussian. 

Lemma 7. The random variable Ej defined in (j20p is subgaussian with param- 
eter Ky/2J:j\s, where T^j^g = - 'Ejs{'^ss)~^'^Sj ■ 

Proof. From the definition of X we have that Xj = Sj-^/^Z and = S^'l'^Z. 
With this, we have 

= Z'(S.y'-S.s(Sss)-'Ss,) 

and 

\\E,\\^, < llS.f -S^/'(Sss)-'Ssj||2||Z||*, 



This concludes the proof. □ 

Next, we present a few results on spectral norms of random matrices obtained 
as sums of random subgaussian vectors outer products. 

Lemma 8 ( Hsu et all 2011 1). Let Zi, . . . , z„ G be i.i.d random subgaussian 
vectors with parameter v, then for all 5 G (0, 1), 

P[||Ki^z,,z^-I|||2>2e(n,<5)]<<5 (21) 

ie[n] 

where 



e{n d) = u^( / 8(plog(9)+log(2/'5)) ^ plog(9) + log(2/^) 



The above result can easily be extended to variables with arbitrary covari- 
ance matrix. 

Lemma 9. Let xi,...,x„Rp be independent random vectors satisfying x^ = 
'S^^'^Zi with Zi, . . . , z„ being independent subgaussian vectors with parameter v 
and S"'^/^ is the symmetric matrix square root of S. // Amax(5^) < oo and 
Aniin(S) > 0, then for all S G (0, 1) 

P[|||n-i ^^^^ - > 2A„,ax(S)e(n, S)] < 6 (22) 

ie[n] 

and 

P[|||(n-i J2 ^■'^'^)"' - i^y'h > 2e(n,5)/A^i„(S)] < S. (23) 

ie[n] 
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Proof. We have that 

|||n-iX'X - SIII2 = |||Si/2(„-iz'Z - Ip)Si/2|||2 < A^ax(S)|||n-iZ'Z - 



and ^ fohows from (|2T|) . 
Similarly, we can write 

|||(n-iX'X)-i - = |||S-i/2((n-iZ'Z)-i - 

<A-J„(S)|||(n-iZ'Z)-i-I,|||2 

and (1131) follows from (El]). □ 

7.2 Proofs and Technical Results 

For convenience, we restate technical conditions used in the paper. 

Assumption 1. The matrix X = (xi, . . . , x„)' g R"^^ has independent rows 
that satisfy x^ — S^/^Zj where {z^ji are i.i.d. subgaussian random variables with 
Ezj = 0, Ez^z^ = I and HziHip^ < K. Furthermore, there exist two constants 

Cmin, C'max > SUCh that 

< Cmin < An,in(S) < Amax(S) < C^ax < OO. 

Assumption 2. The errors ei, . . . , e„ are i.i.d. with € B^^,^ (1). 
Assumption 3. There are two constants P and 6 such that ||/3|| < /3 < C)o 
and ||0|| < ^ < oo. 

Assumption 4. \S\ = Csn"^ and \T\ = Cxn"^ for some as e (0, 1) and 
ar G (0, 1/3) and constants Cs, Ct > 0. 
Assumption 5. Define 

T>ss = n-iX^diag(exp(-X0))Xs. 

There exist constants < i^min, Dmax < 00 such that 

lim P[A„,ax(Dss) < -Dmax] = 1, and 

lim P[A„,i„(Dss) > ^min] = 1. 

7.3 Proof of Theorem [1] 

We will split the proof in two parts. In the first part, we show that the vector 
/3 — {0'g,O'gcy, where 0s = (^s^sj^^X'^y, is a strict local minimizer of ([2]) 
and S = {j : Pj ^ 0}. In the second part, we use results for pseudo- maximum 
likelihood estimates to establish asymptot ic normality of 0s- 

From Theorem 1 in iFan and Lvl (2009| . we need to show that /3 satisfies 



X's(y - X^) - nsign(/3s) O p'x{0s) = 0, (24) 
||X'sc(y-X/3)|U <np^(0+), (25) 
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and 



A„,i„(n-iX'sXs) > max (26) 



in order to show that /3 is a strict local minimizer. 
Define the events 

Ai = (maxexp(|x',0|) < exp ( J K^,^{llTT)\\e\\l\og{2n / 5) 
where T ^ {j : 9, Q} and 

A2 = {A„,ax((n-^X'5Xs)-l) < 3/A„in(Sss)}. 

To simplify notation, we define 



a' = exp (^^X2A,„,,(STT)||0|lilog(2n/,5)j . (27) 

The following two lemma shows that the events Ai and A2 occur with high 
probability. 

Lemma 10. Under the assumptions of Theorem\^ we have that P[^i] > 1 — 5. 
Proof. We have 



Lemma follows by setting t = -^/X^AmaxCSTT)!!^!!! log(2n/(5) in dTT]) and using 
the union bound. □ 

Lemma 11. Suppose that the assumptions of Theorem]^ are satisfied. Further- 
more, assume that n is big enough so that 



^2/ / 8(a7i"Mog(9)+log(2/J)) ^ an"Mog(9) + log(2/^) \ 



Then P[A2] >l-5. 
Proof. We have that 

with probability 1 — 6 using and the fact that n is large enough so that 
e{n,S)<l. □ 

Recall that /3s is an ordinary least squares estimator using variables in S, 
so that 

$S-f3s^ {X'sXsr'X'sV - (X'5X5)-iX^diag(e^^/2)e. 
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Define 

M := (X'sX5)-iX^diag(e^^)Xs(X^Xs)-^ 

Conditioned on X, using Lemma H] we liave that e'j{0s — /3s) is subgaussian 
with parameter y^2mjj, j & S. Therefore 

n\\0s - /3sl|oo > i I X] < 2sexp (-- ) . (28) 

On the event Ai Ci A2, 

3a2 



maxm^j < n V^A^axll" ^X'^Xs) ^) < . ,^ s ■ 

jeS An-iin[2^SS)n 



Setting t — •\/2(maxjgs Wjj) log(2s/5) in ([28|) and conditioning on the event 
Ai n A2 and its complement, we have 



y Ainin(2^SS)"- 

where C — if^Ainax(STT)| 1^1 12 with probabihty 1 — 3(5. Under the assumptions, 
we have that \\$s - /3s||oo < A. 

Using the result obtained above, we have that 

minl^jl >min 1/3,1 -||/3s-/35||oo 

>Pmin-\\$S-ps\\oo 
> Pn,in/2 > A, 

since /3,nin > n^'^'logn with 7 e (0,1/2]. This gives us that p'{(3s) — and 
maxjes{-/?A(l/3il)} = showing ^ and ([Mil- 
Using Lemma |6] and Lemma 17.11 we write X, e R" as Xj = X5T5 + Ej, 
J e S^' , with Ej having elements that are subgaussian with parameter K^ySj^g. 
Therefore 

n-ix;.(y-X/3) = (XsTs+E,)'(I-Ps)y = n-iE;.(I-Ps)diag(exp(X0/2))e. 
Denote 

Tij = n^^||Ej||2 maxexp(x,^0) 

and observe that nj > n"^||Ej(I - P5')diag(exp(X0/2))| I2. Condition on X, 
then for any j E S'~^ , 

P[|n-iE;.(I- Ps)diag(exp(X6>/2))e| > t] < 2exp{-t^ /n^). (30) 

Using the union bound together with Lemma [5l 

max ||Ej||2 < 3X(maxEj|5)n 
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with probability at least 1 — {p — s)exp(— 2n). Conditioning on the event Ai 
and its complement 

maxnw < 3K(niax'E,i\a)n~^ exr)(\/C\o3:(2n/S)) 
where C = i4r^Aijiax(STT)||^||2 with probability 1 — 5— (p— s) exp(--2n). Picking 



in (1301) and combining with the above, we have shown that 



||n-iX'sc(y-X/3)||oo < /(maxn,)log(2(p-s)/<5) 

with probability 1 — 2(5 — (p — s) cxp(— 2n). Since 

/3i^(maxS,|s)n-i eMVcT^S^^) log(2(p - s)/d) < A/2 < ^^0+) 

we have shown that /3 is a strict local minimizer. This finishes the proof of the 
first part. 

We are now ready to show asymptotic normality of (3. Define W = diag(exp(— X0 
From the proof of the first part and the assumption (5), we have that 

0-/3^ (X'sXs)-iX'We 

= n-i/2(Ss5)-i(ED55)^/'e + 0p(l), 

where the small order term is understood under the L2 norm. Write 

a'(/3-/3)= ^c,e, 

ie[n] 

where q = n-^/'^a'{^ssy^{^'Dss)\^'^ ■ It follows that 

Var(c,e,) = n'^a' {•Essr^EBssi'Sssr'a, 

and 

ie[n] iG[n] 

<Cn-3/2||a'(Ss5)-^||^Ell(™S5)-V'll' 

ie[n] 

= 0(1). 

This allows us to apply the Lyapunov's theorem to conclude the proof of the 
theorem. 
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7.4 Proof of Theorem [2] 

Consider an oracle that performs the second stage of HIPPO with full knowl- 
edge of the sparsity set T, resulting in the estimator 0^, with Xt = 0. (viz. 6x 
is the pseudo-likelihood maximizer by forming the likelihood with the estimated 
residuals from the OLS.) Then 9 = (0^,O^c)' is a strict local minimizer of the 
program © for At ^ log(p). We derive necessary and sufficient condi- 

tions for /3 to be a str i ct loc al minimizer of the program ([2]) akin to those in Theo- 
rem 1 in lFan and Lvl j2009l |. We show that the PML is asympt otically equivalent 
to the maximum likelihood estimator through a lem ma from|H iort and. PoUardl 



1993] and by following arguments similar to i Jobson and Fuller 1980j. As op- 



posed to the classical asymptotic theory, we implicitely construct finite sample 
results because |T| is allowed to grow with n. 

Lemma 12. Let M = X(X'X)-iX' then, 

max||M,||=Op (^-^^^L^ log(n) 
Proof. It is known bv lHsu et al. 2011 1 that with probability at least 1 — 5, 



1 '"^ / 
|||-5]x:^sX,.s-Ss.s||| =Op lllEs.; 



s + \og{l/5) 



Then we have that 



n||M,|| = ||x,,5(^^)-^Xj|| < ||x,,sS5,^sX5ll + ||x,,5((^^)-^-S^i)X 
Controlling first the rightmost term, 



|x.,,s(( ) -^s,s)-^s\\^ 



by a result we obtain below, Amin(^^^p^) and Ainin(I]5^5) are bounded below 
by a constant with high probability. By the result above. 



furthermore we have that IIX5II = Op{y/s\og{l/S)). Hence the second term is 

Op((slog(l/5))3/2/V^). 

Now consider the first term, ||xi.5l]^"^Xj||. And write Uj = S^^Xj^s then 



we have that 



|x,,sI]^i^Xjf < + I J2 U7U,\ < \m' + |lC/.||V2nlog(2/<5) 
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because 



Y^UjU^ <\\Ui\W'^n\og{2/5) 

with probability 1 — (5 by the sub-Gaussianity of {Uj}. Thus, ||xi^sE^"^Xj|| = 
s\og(l/5) + ■v/2snlog(2/5)). So, assuming that s « n then 



□ 



Lemma 13. Consider both the empirical residuals, rj, and the true residuals, 
T}. Then 

max|^,2-r;,^|=op(^^ 
for any 7 e [0, (1 - as)/2). 

Proof. First let us expand the terms in question: 

fi - ril = [{I - yi-Hl - ril = [ri- Mr;]? - tjI = (M,?)^ - 2r;,(Mry)? 
Now we take a closer look at the right hand side, 

[n] [n] 

(Mr?)2 - 2r;,(M,7)2 = (^M,,,r,,)2 - 2%^M,,,r,, 

i j 

Notice that the true residuals rj are IID sub-Gaussian with parameter at most 
a. Hence, with probability 1 — S 

[n] 

\Y,MijVj\<m\\^V^^og{2/6) 

i 

Hence, we find that 

\{Mr,f, - 2r?,(M,?)2| = Oi\\M,\\aHogl/6) 
Below we show that there is a constant C > such that 

= 0(exp(VC||^||2log(2n/^))) 
with probability at least 1 — S. Hence, 

i/4(log(n/(5)) 



max \n! - vf\ = O C ''If r'' eM^/C\\9\\Hog{2n|5)) 
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with probability 1 — S. Because s ^ n" we have our result then this reduces to, 
niaxlryf - rjf\ ^ Ql ^^^firff^^ expi^C\\e\\^ \og{2n/S))) 



Because both log(n/5))^ and exp{^yC\\9\\'^\og{2n/S)) are subpolynomial in n, 
so, 

max \f]f - Tj^\ = Ori ) 

where 7 > may be arbitrarily close to but less than {1 — as) /'i.. □ 

Lemma 14. Consider the difference of the pseudo-likelihood gradient and the 
true likelihood gradient, 



U„ - U„ = ^ 

i 

If ar < (1 - Q;s)/2 then 

|1U„-U„|1 -op(l) 

Proof. By the above lemma, 

max I V^(77f - ril) \ = op(4-) 

Moreover, we know that maxjg[„] e"^'^ Xi — 0{-\/t4>{n)) where (/'(n) is sub- 
polynomial in n. Because \/t = and ar < 2j then 



max 

ie[n] 



Thus the average of the summands is op{l) and we have our result. □ 

Lemma 15. Consider the difference of the pseudo-likelihood Hessian and the 
true Hessian, 

-, [»1 

V„ - V„ = - Y,{f,f - ,7|)e-<«-x,x: 

i 

If dT < (3 — as)/4 then 

|||V„-V„||| =op(l) 

Proof. This proof follows in the same way as Lemma mutatis mutandis. □ 

Lemma 16. Let 6t be the pseudo likelihood maximizer and \T\ = n"^ with 
aT € (0, -i). And define the gradient and hessian, 

1 " 

tin = ^ V x,,T - fjfe-^-TeT 
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n 

n ^ — ' 

1=1 

then we have that, 

W^iOT - Ot) -'V-^t.^W ^ or{l) 

Proof. Denote the function I/„(6't) = X]i=i ^(^r + v^l-"-*' ''?) "^^lich has mini- 
mizer ^/n(BT — Qt)- Consider the Taylor expansion for L„ about 0, 

for some r g [0,1]. In Lemmata lTil [TSl we devoted ourselves to understanding the 
first and second order terms in this expansion. We now must concern ourselves 
with the remainder, third order term. 



_1 ["1 

R^(e^)^'Y.^}e-<\^fiTf 
nJn ^-^ 

^ i 

_i ["1 



(31) 



Define 0J = V^^Ur, and A„((5) 

— sup|gj^_gt^l^^ i?„(r0x). We control by 
decomposing it as Rn — {Rn — Rn) + Rn- 

, [n] 

\\Rn{eT)-Rr.{eT)\\ < ^ ^ | r},^ " 7y,f |e-<-^- 1 |x,,T 1 1 1 1 1 1' 

' , - (32) 

' a"^ max,gr„i ||xj,T|P||e'||^\ 



Of 



,1+7 



We show in Lemma [10] that a is sub-polynomial in n. Moreover, by the sub- 
Gaussianity of Xi_T, niaxjg[„] ||xi_T||'^ = Op(|rp/^) modulo logarithmic terms. If 
we assume that 8 < \ then < 8||V~^U„|p. We have shown in the previous 
lemmata that ||V^^U„ — V^^U„|| — op(l) under our assumptions. By standard 
maximum likelihood estimation analysis ||V;;iU„|| = Ov{^/\r\) = Op(n"^/2). 
Hence, the RHS of eq. ([5^ is of the order Op(ri'^"^^^^''') modulo sub-polynomial 
terms. Due to the assumption that ay < 1/3 we have that ||i?„ — i?„|| — op(l). 
Notice that this convergence is uniform over Qt because of the specific form of 
eq. [32 

What remains to be shown is that R^iBT) = op(l) uniformly over \Q't — 
d^p\ < 5 for some small 5. By identical arguments to those above we see that 
||g-Xi^6»T(--5j.^ ^gi-)3|| _ Op(||xi_T||?^^"^^^) uniformly over i modulo logarithmic 
terms. So the function fi{i]i,Xi^TW) — Vi^^^'"^^^ i^i.T^)^ be bounded 

uniformly over a domain of radius Op(n^"^/^). fi{rii,Xi^TW) < Ihi^Xi^^f^'^"^^^!! 
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and Ery^llxi^Tll'^'T''^"^^^ — Op{a'^n^"'^). Finally, we may invoke the uniform law 
of large numbers yielding ^j-i=i?„(0'r) — > if ar < 1/3 as is sub-polynomial 
in n. 

We have shown that A{6) if (5 is decreasing. By Lemma 2 in lHiort and Pollard 



1993[, we know that if Amin(Ki) is bounded below in probability and A{S) -> 



uniformly then the minimizer of L„ converges in probability to the minimizer of 
its quadratic approximation, viz. \\y/ri{6T — Or) — V~^U„|| = op(l). We have 
established the latter condition while the former is a direct result of Lemma 

M □ 

Now, we can form the decomposition, 

< ||V-iU„ - V-iU„|| + ||V-iU„ - V-iU„|| + op(l) 

< |||v,;i|||||u„ -u„|| + iiiv-i - v-i|||||u„|| + op(i) 

(33) 

All of these terms are decaying in probability because, |||V^^ — < |||V„ — 

"^"11 lll"^n ^ III lll'^n III • Combining these results we find that 

W^IBt ~ Ot) - V-iU„|| = op(l) (34) 

We are now in a position to establish the oracle property of 



||n— (6/t-0t)|| =Op(l) 

Specifically, by standard MLE analysis we know that for a fixed coordinate 
in J e T, (V-iU„), = Op(l). Thus, ||V^(0t - eT)\\ < WV^ier - Ot) -~ 
V-^Unll + ||V,7iU„|| = Op(n"^/2). Finally, we establish that the estimator 

9t = {Sttj Ogc)' is a strict local minimizer of the program Q. 

The first order conditions for a local optima of the SCAD penalized likelihood 
of a are, 

["] 

J2a ~ vy^'H^ - Va„ m = o, if ^ o (35) 



I 5^(1 - r)|e--^^)xjj < V^„(0+), if = (36) 
3 

where p'\^{a) = sign(a)/3'^^ (a). The second order condition is given by, 

A^in (^ixidiag{exp(-x,(?i)}X7^ > (37) 

where ka„(6') = maxjeT{-PA„(|6'il)}- 

By the previous findings we have that uniformly over j, \0j\ > \6j\ ^ 
Op{n r^) so if At = i^{n 2^) and 6j > CXt for a C > specific to p, 
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eq. ([55]) follows. Moreover, by assumption 5, and similar arguments eq. p7l) 
holds. What remains to be shown is eq. ((36)) . Recall the decomposition of in 
Lemma [T] 

["] [»] 

i i 
[n] [n] [n] 

= II E(l - '?^^^-')EdU < II - Vf)e-<'E,\\^ + II E(l - ^^"^^')Edl 

i i i 

(38) 

We will soon show that e~'''*^^~^^ = op(l) and know by Lemma [TUl that e~''»^ 
is Op(0(n)) where 0(n) is subpolynomial in n. Hence, 



\ 



t I (39) 



|2 
-'*lloo 



= ^OA^)Or[n\og\TC\) = Op(ni/2-7^^(,,)iog(p)) 

This follows from the fact that Ef ^ is sub-exponential, hence, max^ ^ = 
0(log |T'-^|). We now show that e"^''""^-* — op(l) from the subgaussianity of 
||xi^r||. Specifically, because we know that 6 = Op(n("^^^^/^) we will prove 
this uniformly over a neighborhood of radius 0(n'^"^~^-'/^). Notice that over 
this neighborhood, e^^^'^^^^^ < exp(r||x,;.T||/fT-'^~"^''^^) for some r > 0. 

P{r||x,,T|| > < Ce-"" ^ P{r||x,,T|| > i^n°'^/^} < Ce-""^ 

^ n > ^ Cexp(-cKnV--)) (40) 



In a similar way we may bound the square of exp( ^l^^a.^)/2 ) in probability. Now 
in a uniform bounding technique similar to eq. (jSSf we can show that the second 
term in the RHS of eq. ^ is Op{n°'-^-'^/'^ \og{p)). Define the variables = 
g-^i,Ti^-s) — I and b„ = o(ri^/^~"^). We see that under the same neighborhood 
for 9 6, 

m^ >^}< C[il + ^)--''"]«^'^-°Vfc. = 0(1) 

co„ c6„ 
Hence, = op(n"^^^/^ log(rt)) and we may bound the remaining term. 

I Y^il - vl^<')Eh,\ = 1 1:(1 - 6^e--(^-«))£;,,| 

i i 
[n] [n] 

< I E(l - 4)E^A + I E ^?^'^^^-| ^ + Op(n"-+i/2 log(p)) 



(41) 
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by the central limit theorem and Cauchy-Schwartz. Similarly, we can bound 
this uniformly over j and obtain an additional log(p) factor. Hence, At 
fi-'^/'^+o'T log(p) log(n) for eq. ([55)) to hold. Thus 6 is the strict local minimizer 
of the SCAD penalized program. 

The weak central limit theorem in (jlip is a direct application of the standard 

CLT to (p4|. Specifically, a'V„U„ A/'(0,C) because the Fisher information 
of the true likelihood with respect to 9 is St- While a'(-\/n(0— 0)— V„U„) — > 
by p4p and the construction of 9. 

7.5 Proof of Theorem [3] 

The proof follows the same lines as the proof of Theorem [U however, there 
are some technical challenges that arise from having only an estimate of the 
variance. 

We define W = diag(exp(-X0/2)) and W = diag(exp(-X6>/2)). Further- 
more, we win use Dss = n^^X'^W^Xs and T>ss = n-^X'^W^Xs. 

We proceed to show that /3„, = {f3'^ S'^sc)' is a strict local minimizer of 
([U where f3'^ g — n^^'Dgg'X.'g'W^y, by showing that /3,„ satisfies 

X'sW(Wy - WX/3) - nsign(^^,5) Px{0^u,s) = 0, (42) 
||X^cW(Wy- WX/3^)|U < Va(0+), (43) 

and 

A,ni„(7i-^X'sW2Xs) > max {-pHWu.J)}- (44) 
Recall the definition of the event Ai from the proof of Theorem [U 

Ai = (maxexp(|x^0|) < exp ( J 2 A,„,, (Stt ) | | | | i log(277y,5) 

L le [n] \ V 

with ¥[Ai] > 1 - (5. Also, recaU that 

a' = exp (^^X2A„,ax(STT)||e||llog(2n/<5) 

Next, we define the event 

^3 = {max |x^(0 ~9)\<K\\9- 0||2Ai/f^(STT)0og(2n/^)} 

"1/2 

and note that ||x-(0 — 0)| l*^ < |0 — 0| |2A,„ax(STT), since under the assump- 
tions 9 has the same support as 9. Using (jl5p and the union bound, we have 
that P[^3] >l-S. 

We will also use the event 

Ai = {max {9 - 9y^,,T^^ - 9) < A^-,,^ii:TT)\\9 - 9\\l log{2n/S).} 

ie[n] 
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Setting u — K An(a^{'STT)\\S — ^||2-\/log(2ri/(5) in ([T8| and applying the union 
bound, we obtain that P[.44] > 1 — S. 
FinaUy, define the event 

A = {A^ax(ri~'X^XT) < 3A,„ax(STT)}. 

Similar to the proof of Lemma [TTl we have that P[y^5] > 1 — Sioin large enough 
so that e(?T.,(5) < 1, with e{n,S) defined in Lemma |8l 

In the following analysis, we condition on the event Ai n ^3 n A^. 

The following decomposition 



(45) 



will be useful for establishing a bound on \\0w,s — l3w,s\\oc,- We investigate each 
of the terms separately. 

Let cr^(x, 0) — exp(x^0). First, we will need to control the deviation of 
tT^^(x, 9) from cr^^(x, 6). Using the Taylor expansion 



a2(xi,0) 

^ ^ ^'^-(x„0)(^-0) + (0-0)'7;(^-0) 



(t2(x„0) (x„ 0)90^ 

where ^ satisfies ||^ — 0||2 < ||^ — 0\\2- 
On the event Ai n ^3 , we have 



max 

i<£ [n] 



cr3(x,,6>) 90 
= max|exp(x:0)xK^-0)| 

ie[n] 

max 1x^(0 -0)1 

iG[n] 

< C7||0 - 0||2exp (c||0||2Vlog(2n/^)) log(2n/(5) 

1 /2 

where C = if Aniax ( S tt )■ 

Basic calculus gives us that % — [Tab\ab with 

_ exp(-x^0) 



(46) 
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On the event Ai H A4 

ma^ {6 - ey%{e - 6) < {ay2)K^A^,^i-ETT)\\e - e\\l\og{2n/S). (47) 

i£ [n] ^ ' 

Combining and (|T7|) . we have 

|||W2 - W2|||2 < g'^CWB - e\\2{\ + - 0||2/2)log(2n/5) (48) 

1 /2 

where C — /CAmax(STT)- With this, we have that 
|||D - DIII2 = |||n-iX'5(W2 „ w2)Xs|||2 

< |||n-iX'sXs|||2|||W2-W2|||2 (49) 

< a23A^ax(Sss)C||0 - 0||2(1 + - 0II2/2) log(2n/5). 

From (|49p and under the assumptions of Theorem, we have 

-^min (Dss) > Amin(Dss) - IIIDss - Dssib > A„i„/2 
for sufficiently large n. Combining the last two displays 

»— 1 T*— llll lllf\— 1 /"TV M-»— 111 



- Dsillb = ll|Dsi(D55 - Dss)D^^|||2 

< |||D^i|||2|||D5s-Dss|||2|||D^i|||2 



2 

< i'-'sS — D5s|2- 



(50) 



We are now ready to bound each term in (|45|) . For the first term, we have 

,-i||(Dgi-D^i)X'5(W2-W2),7||oo 

< n-i||(D^i - D^i)X^(W2 - W2)r,||2 

< n-V2|||D-i _ D5i|||2|||n-ix;.X5|||f IIIW2 - W^ibHrjIb 

<-^l^|||n-iX^Xs|||^/'|||W2-W2|||2||^||2 

< ^SL^^}^^S^^ssil^^^^ _ ^||2(i ^ ^11^ _ ^11^/2)2 V(2n/<5)||e||2 

2V3C^5-^(3Amax(S55))^/^ ' fl||2/i , r^wn a\\ /0^2 l 2/r, /c\ 

< ||6»-6»||2(l + C||6»-6/||2/2) log (2n/(5) 



mm 



(51) 

with probability at least 1— exp(— 2n). The last inequality follows from Lemma[3] 
with u = \/3- 
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Similarly, the second term in yields 



-i||(D^i-D^^)X'sW2,7|U 

<n-i||(Dsi-D^i)X'sW2r,||2 
2 



mm 



<-^^|||n-iX'<,X5|||2|||W2-W2|||2|||n-iX'<,W2X5|||^/'||6|b 



(52) 



6V3Ca^A.nax(Sss)i^y' 



^ "Vo^u maxV &.by max ||g _ ^||_^^^ + C||0 - 6>||2/2) log(2n/(5) 



min 



with probability at least 1 — exp(— 2n). 
For the third term, we have 

< -^—l\n-'X'sXs\ll^'l\W^-W%\M2 (53) 

< 3g^'^^^ax(Sss) ^^^ _ ^11^^^ ^ ^11^ _ log(2,V<^) 

with probability at least 1 — exp(— 2n). 

Finally we deal with the forth term in (pS)) . Proceeding as in the proof of 
we have that 



.-||D-X',W^,|U < r-^^^ (54) 



with probability 1 — S. 

Combining ((SH), and (IMl) we have that 



\\f^w,S — f3w,s\\oo = O(||0-0||2exp(Cyb^)log(n)) < A (55) 

with probability at least 1 — exp(— 2ri) — 56. This also shows that min^gs \/3j \ ^ 
A. Therefore, we have shown (|^^ and (jH]). 

Using Lemma |6] and Lemma f7.1[ we write Xj e M" as Xj = Xgrs + Ej, 
j G 5*^, with Ej having elements that are subgaussian with parameter Ky^'Ej^g. 

Denote g = I - WXs(X'sW2Xs)"^X;;5W the projection matrix. Then 

n-ix;.W(Wy - WX/3„,) - ^-^(Xsxs + EjO'WPi^^Wy 

= n-^E^-WP^ ,5Wdiag(exp(X6l/2))e 



32 



Conditioned on X, we have that 

P[max |Z' e| > t] < 2{p - s) exp ( ^ ,,„ ^ (56) 

using Lemma m and P3|) together with the union bound. 
We proceed to bound max^ggc HZjUj. Write 

Zj = n.-^mw - W)Pi_s(W - W)diag(exp(X6i/2)) 



+ n-^E^ WPi s(W - W)diag(exp(X6'/2)) 

Using (|48p . we have that 

||Zj||2 < n-i||E,||2(|||W - Wl^a + 2|||W - W|||2 + a) 



(57) 



< ^ /3if (maxS,|5)n-i/2(|||W - Wjl^a + 2|||W - W|||2 + a) 



(58) 



with probabihty 1 — {p — s) exp(— 2n) using Lemma [5] with the union bound. 
Therefore, we have that 



||n-ix;.W(Wy- WX/3^)||oo < /max ||Z,||2 log(2(p - s)/<5) 

This concludes the proof of the first part. 

Second part of Theorem follows similarly to the proof of Theorem [1] We 
have already shown 

~ fiw,s = n-^D^^X'sWe + Op(l), 
and the rest follows as in Theorem [T] 
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